Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-17211: [C++] Add hash_64 scalar compute function #39836

Draft
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

drin
Copy link
Contributor

@drin drin commented Jan 29, 2024

This PR is a fresh continuation of #13487 which was closed purely in favor of this one. The reason for this is that the other PR carried a lot of "development burden" in the form of a long commit history. This fresh version maintains the current state of the code in much fewer commits with a cleaner history. Hopefully this PR will allow others to finish what I started (if I don't finish this first).

The changes still required is to add functional mechanisms for the following:

  • A TempVectorStack for local allocation within the kernel

    • This is necessary to allocate a modified structure of the input data to then be passed to the hashing functions
  • Functional testing of a kernel that can handle a nested structure (e.g. a List[List[utf-8]]).

  • Closes: [C++][Compute] Add scalar_hash function #17211

@drin drin changed the title [C++] Add hash_64 scalar compute function GH-17211: [C++] Add hash_64 scalar compute function Jan 29, 2024
@apache apache deleted a comment from github-actions bot Jan 29, 2024
@judahrand
Copy link
Contributor

@drin are you likely to get some time to push this forward? #32504 is causing us issues and I'd be great to move towards solving it!

@drin
Copy link
Contributor Author

drin commented Jun 3, 2024

sure! next week I'll have more time and I'll try to squeeze in a bit of time this week.

@drin
Copy link
Contributor Author

drin commented Jun 19, 2024

@judahrand sorry, I ended up taking a break last week, but I happened to remember this today. I will work on this issue this week and next week and my goal will be to have it finished (or close to finished) by end of next week.

I'll try to stage changes so that I can validate on a single nested structure like what is described in #32504 (list<>) and once I have some tests for that working I will try to accommodate arbitrary nesting (list<list<...>>).

@drin drin force-pushed the ARROW-8991-newfn-scalar-hash-fresh branch from 255d0be to 23558bf Compare June 28, 2024 00:09
@drin
Copy link
Contributor Author

drin commented Jun 28, 2024

rebased on main branch

@drin
Copy link
Contributor Author

drin commented Jul 3, 2024

@judahrand, I wanted to verify something with you since I assume you have an immediate use case in mind for #32504 .

The naive implementation of hashing for a "shallow" nested type (e.g. list<int8>) is relatively straightforward and the code available in this PR should be able to handle that already.

What held up this PR was logic for "deep" nested types (e.g. list<list<int8>>). To move towards addressing this, I am considering the following simple scenario and hoping to get your feedback that it is addressed reasonably (this is why it would be helpful if you have use cases in mind):

# A "shallow" nested type that we should be able to reproduce (I think)
Reference Test data [length: 3]:
[
  [ 1, 1, 2, 3,  4 ],
  [ 2, 2, 4, 6,  8 ],
  [ 3, 3, 6, 9, 12 ]
]

# A "deep" nested type that we may receive as input
Sample Test data (list<item: list<item: int64>>| [length: 3]):
[
  [ [ 1 ], [ 1, 2, 3,  4 ] ],
  [ [ 2 ], [ 2, 4, 6,  8 ] ],
  [ [ 3 ], [ 3, 6, 9, 12 ] ]
]

There is ambiguity in how a "deep" nested type should be hashed (maybe the nested structure should produce a different hash for a "row" of values), but what makes the most sense to me is for it to be value-based (the above example). In the case that a different approach is preferred, other approaches can either be encoded as future options or a UDF provided instead of the existing hashing functions.

I have prototyped a naive version of the flattening logic in a separate repo: recipe_convert.cpp#L150-L179. For context, this type of flattening would be used in the FastHashScalar::Exec function as a preprocessing step before calling the Hashing32::HashMultiColumn function.

Thanks for any feedback you can provide! Particularly on expected behavior for nested array types.

@drin
Copy link
Contributor Author

drin commented Jul 3, 2024

An alternate approach is to return a

Status::NotImplemented("Cannot hash nested data types with many levels of nesting")

if any of the columns to hash are "deep" nested types.

@judahrand
Copy link
Contributor

There is ambiguity in how a "deep" nested type should be hashed (maybe the nested structure should produce a different hash for a "row" of values), but what makes the most sense to me is for it to be value-based (the above example).

I'm not entirely sure I've understood your distinction between the two cases. Am I correct in understanding that in the first of the two options that you describe a row containing[[1], [1,2,3,4]] would have a different hash value to [[1,1], [2,3,4]] but in the second, 'value-based' option these two lists would have the same hash?

If that is the case then the 'value-based' behaviour would be surprising to me. I'd have thought that two rows with what is definitely different data should have different hashes (not considering the small chance of a hash collision).

For the case of nested listed I wonder if this issue could be avoided by including the offsets of the parent list in the hash?

@drin
Copy link
Contributor Author

drin commented Jul 5, 2024

I'm not entirely sure I've understood your distinction between the two cases. Am I correct in understanding that in the first of the two options that you describe a row containing[[1], [1,2,3,4]] would have a different hash value to [[1,1], [2,3,4]] but in the second, 'value-based' option these two lists would have the same hash?

Oh yeah, I guess I didn't make that part clear. But yes, you understood correctly. I was in favor of [[1], [1, 2, 3, 4]] having the same hash as [[1, 1], [2, 3, 4]] (what I called the "value-based" option).

If you want those to have different hashes, then it may be possible to transform the list array into a struct array and one struct column can be all of the offsets coalesced and a second struct column can be all of the values (referenced). To do it any other way would require extending the existing hashing functions in non-trivial ways or adding new hashing functions entirely.

So, maybe for this PR I can simply do shallow nested arrays and someone can work on the deep nested arrays in aa follow-up PR.

This commit ports the latest state of scalar_hash kernels without
pulling a long development history.

This kernel is an element-wise function that uses an xxHash-like
algorithm, prioritizing speed and not suitable for cryptographic
purposes. The function is implemented by the `FastHashScalar` struct
which is templated by the output type (which is assumed to be either
UInt32 or UInt64, but there is no validation of that at the moment).

The benchmarks in scalar_hash_benchmark.cc uses the hashing_benchmark.cc
file as a reference (in cpp/src/arrow/util/), but only covers various
input types and the key hashing functions (from key_hash.h).

The tests in scalar_hash_test.cc use a simplified version of hashing
based on what is implemented in the key_hash.cc. The idea being that the
high-level entry points for high-level types should eventually reach an
expected application of the low-level hash functions on simple data
types; the tests do this exact comparison. At the moment, the tests pass
for simple cases, but they do not work for nested types with non-trivial
row layouts (e.g. ListTypes).

Issue: ARROW-8991
Issue: apacheGH-17211
This commit pulls the latest changes to key_hash.h and implementations
in light_array without the burden of a long development history.

The only change in key_hash.h is the addition of a friend function which
is used in scalar_hash_test.cc.

Changes in light_array.[h,cc] are to accommodate two scenarios: (1) the
use of ArraySpan, which was introduced after light_array was written;
and (2) the need for a KeyColumnArray to allocate data for the purposes
of interpreting (or decoding) the structure of a nested type.

The main reason for the 2nd scenario is that a ListArray may have many
values represented in a single row which should be hashed together;
however, if the ListArray has a nested ListArray or other type, the row
may have further structure. In the simplest interpretation, only the
highest-level structure (the "outer" ListArray) needs to be preserved,
and any further nested structures must be explicitly handled by custom
kernels (or any future options, etc. that are upstreamed into Arrow).

In trying to efficiently interpret complex nested types, ArraySpan can
be useful because it is non-owning, thus the main reason for the 1st
aforementioned scenario.

Although unfinished, any tests added to light_array_test.cc should
accommodate the 2 scenarios above.
This commit includes changes to register a new compute function without
the burden of a long development history.

The change to cpp/src/arrow/CMakeLists.txt includes scalar_hash.cc in
compilation as it is used by the new Hash64 function defined in
api_scalar.[h,cc].

The change to cpp/src/arrow/compute/kernels/CMakeLists.txt includes
scalar_hash_test.cc in compilation for tests and it also adds a new
benchmark binary that is implemented by scalar_hash_benchmark.cc.

The registry files are updated to register the kernel implementations in
scalar_hash.cc with the function definitions in api_scalar.[h,cc].

Finally, docs/source/cpp/compute.rst adds documentation for the Hash64
function.

Issue: apacheGH-17211
Issue: ARROW-8991
This commit includes additions to the general hashing benchmarks that
cover the use of hashing functions in key_hash.h without carrying the
burden of a long dev history.

Some existing benchmark names were changed to distinguish between the
use of Int32 and Int64 types, new benchmarks were added that use the
functions declared in key_hash.h. The reason the new benchmarks are
added is because it is claimed they prioritize speed over cryptography
as they're primarily used for join algorithms and other processing
tasks, which the hashing benchmark can now provide observability for.

Issue: apacheGH-17211
Issue: ARROW-8991
@drin drin force-pushed the ARROW-8991-newfn-scalar-hash-fresh branch from 23558bf to b2af8c9 Compare October 7, 2024 22:18
This is primarily to make the next commit a bit more focused.
Previous commits tried to add allocation mechanisms to KeyColumnArray to
accommodate arbitrarily nested types. While some allocation is necessary
to accommodate arbitrarily nested types, the work is being split into
incremental steps and scope is being reduced to the straightforward
nested types.

Specifically, we want to support: ListArray, StructArray, and MapArray.
We **only** support these types if they contain primitive types (e.g.
List<int8> not List<List<int8>>, or Map<int64, float32> not Map<string,
int32>).

The difficulty with nested types beyond the first level is that multiple
buffers of offsets must be accommodated and that support requires
changes that either: (1) will propagate through key_hash code or (2)
requires a (temporary) allocation of an array that flattens offsets into
a single offset buffer.
Needed to add a check when constructing a KeyColumnArray for a ListArray
that the element type of the ListArray is primitive.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C++][Compute] Add scalar_hash function
2 participants